A Survey of Retrieval Strategies for OCR Text Collections

نویسندگان

  • Steven M. Beitzel
  • Eric C. Jensen
  • David A. Grossman
چکیده

The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Length Normalization in Degraded Text Collections

Optical character recognition (OCR) is the most commonly used technique to convert printed material into electronic form. Using OCR, large repositories of machine readable text can be created in a short time. An information retrieval system can then be used to search through large information bases thus created. Many information retrieval systems use sophisticated term weighting functions to im...

متن کامل

Querying Short OCR'd Documents

Studies have shown that OCR errors have little eeect on average precision for full text collections. A question that was left unanswered from these studies was how OCR errors would aaect short document collections. This issue was examined in this study using documents consisting of only titles and abstracts. The results of our experimentation are presented in this paper.

متن کامل

Retrieving OCR Text: A Survey of Current Approaches

The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.

متن کامل

Effects of OCR Errors on Ranking and Feedback Using the Vector Space Model

We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In ...

متن کامل

OCR correction based on document level knowledge

For over 10 years, the Information Science Research Institute (ISRI) at UNLV has worked on problems associated with the electronic conversion of archival document collections. Such collections typically have a large fraction of poor quality images and present a special challenge to OCR systems. Frequently, because of the size of the collection, manual correction of the output is not affordable....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003